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Abstract 

This paper argues that the half-Cauchy distribution should replace the inverse- 
Gamma distribution as a default prior for a top-level scale parameter in Bayesian hi- 
erarchical models, at least for cases where a proper prior is necessary. Our arguments 
involve a blend of Bayesian and frequentist reasoning, and are intended to comple- 
ment the original case made by Gelman| ( |2006| ) in support of the folded-f family of 



priors. First, we generalize the half-Cauchy prior to the wider class of hypergeomet- 
ric inverted-beta priors. We derive expressions for posterior moments and marginal 
densities when these priors are used for a top-level normal variance in a Bayesian hi- 
erarchical model. We go on to prove a proposition that, together with the results for 
moments and marginals, allows us to characterize the frequentist risk of the Bayes 
estimators under all global-shrinkage priors in the class. These theoretical results, in 
turn, allow us to study the frequentist properties of the half-Cauchy prior versus a wide 
class of alternatives. The half-Cauchy occupies a sensible "middle ground" within this 
class: it performs very well near the origin, but does not lead to drastic compromises 
in other parts of the parameter space. This provides an alternative, classical justifica- 
tion for the repeated, routine use of this prior. We also consider situations where the 
underlying mean vector is sparse, where we argue that the usual conjugate choice of an 
inverse-gamma prior is particularly inappropriate, and can lead to highly distorted pos- 
terior inferences. Finally, we briefly summarize some open issues in the specification 
of default priors for scale terms in hierarchical models. 
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1 Introduction 



Consider a normal hierarchical model where, for i = 1, ... ,p, 

(yi\Pi,(J 2 ) ~ N(A-,a 2 ) 
(A|A 2 ,a 2 ) ~ N(0,A 2 a 2 ) 
A 2 ~ g(A 2 ). 

This prototype case embodies a very general problem in Bayesian inference: how to choose 
default priors for top-level variances (here A 2 and a ) in a hierarchical model. 

The routine use of Jeffreys' prior for the error variance, p{o 2 ) °< o ' 2 , poses no practical 
issues. This is not the case for p{X 2 ), however, as the improper prior p(X 2 ) °< A~ 2 leads to 
an improper posterior. This can be seen from the marginal likelihood: 

p , 
p(y\ A 2 )oc]^[(i+A 2 )-zexp 

£=1 

where we have taken a 2 = 1 for convenience. This is positive at A 2 = 0; therefore, whenever 
the prior /?(A 2 ) fails to be integrable at the origin, so too will the posterior. A number of 
default choices have been proposed to overcome this issue. A classic reference is |Tiao and 




Tan 



( 1965 1; a very recent one is [Morris and Tang (201 1 1, who use a flat prior p(X 2 ) °< 1. 
We focus on a proposal by Gelman (2006 i), who studies the class of half-? priors for the 
scale parameter A : 

tf\ -(d+l)/2 



p(X I d) « [\ + 

for some degrees-of-freedom parameter d. The half-? prior has the appealing property that 
its density evaluates to a nonzero constant at A = 0. This distinguishes it from the usual 
conjugate choice of an inverse-gamma prior for A 2 , whose density vanishes at A = 0. As 



Gelman ( 2006 ) points out, posterior inference under these priors is no more difficult than it 
is under an inverse-gamma prior, using the simple trick of parameter expansion. 

These facts lead to a simple, compelling argument against the use of the inverse-gamma 
prior for variance terms in models such as that above. Since the marginal likelihood of the 
data, considered as a function of A, does not vanish when A = 0, neither should the prior 
density p(A). Otherwise, the posterior distribution for A will be inappropriately biased 
away from zero. This bias, moreover, is most severe near the origin, precisely in the region 
of parameter space where the benefits of shrinkage become most pronounced. 

This paper studies the special case of a half-Cauchy prior for A with three goals in 
mind. First, we embed it in the wider class of hypergeometric inverted-beta priors for A , 
and derive expressions for the resulting posterior moments and marginal densities. Second, 
we derive expressions for the classical risk of Bayes estimators arising from this class of 
priors. In particular, we prove a result that allows us to characterize the improvements in 
risk near the origin (||/$ || ~ 0) that are possible using the wider class. Having proven our risk 
results for all members of this wider class, we then return to the special case of the half- 
Cauchy; we find that the frequentist risk profile of the resulting Bayes estimator is quite 
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favorable, and rather similar to that of the positive-part James-Stein estimator. Therefore 
Bayesians can be comfortable using the prior on purely frequentist grounds. 

Third, we attempt to provide some insight about the use of such priors in situations 
where j3 is expected to be sparse. We find that the arguments of Gelman ( 2006 1 in favor of 
the half-Cauchy are, if anything, amplified in the presence of sparsity, and that the inverse- 
gamma prior can have an especially distorting effect on posterior inference for sparse sig- 
nals. 

Overall, our results provide a complementary set of arguments in addition to those of 



Gelman (2006) that support the routine use of the half-Cauchy prior: its excellent (frequen- 



tist) risk properties, and its sensible behavior in the presence of sparsity compared to the 
usual conjugate alternative. Bringing all these arguments together, we contend that the half- 
Cauchy prior is a sensible default choice for a top-level variance in Gaussian hierarchical 
models. We echo the call for it to replace inverse-gamma priors in routine use, particularly 
given the availability of a simple parameter-expanded Gibbs sampler for posterior compu- 
tation. 



2 Inverted-beta priors and their generalizations 



Consider the family of inverted-beta priors for A 2 : 

(A 2 )*- 1 (1+A 2 ) 



-(a+b) 



Be(a,b) 

where Be(a,b) denotes the beta function, and where a and b are positive reals. A half- 
Cauchy prior for X corresponds to an inverted-beta prior for A 2 with a = b = 1/2. This 
family also generalizes the robust priors of |Straw derman] p"97 1 [ ) and[Berger (1980); the 



normal-exponential-gamma prior of Griffin and Brown (2005); and the horseshoe prior of 
Carvalho et al.| ( 2010| ). The inverted-beta distribution is also known as the beta-prime or 
Pearson Type VI distribution. An inverted-beta random variable is equal in distribution 
to the ratio of two gamma-distributed random variables having shape parameters a and b, 
respectively, along with a common scale parameter. 

The inverted-beta family is itself a special case of a new, wider class of hypergeometric 
inverted-beta distributions having the following probability density function: 

-l 



p(X z )=C- l {X 



-l/i 2\b-\ 



(A 2 + l)-(^) expj-y^} 



T 2 + 



1 



1+A 2 



(1) 



for a > 0, b > 0, T 2 > 0, and This comprises a wide class of priors leading to 

posterior moments and marginals that can be expressed using confluent hypergeometric 
functions. In Appendix [A] we give details of these computations, which yield 



C = e~ s Be(a,b) $i (b, 1 , a + b, s, 1 - 1 /t 2 



(2) 



where <I>i is the degenerate hypergeometric function of two variables (Gradshteyn and 
Ryzhik| 1965 9.261). This function can be calculated accurately and rapidly by trans- 
forming it into a convergent series of 2-F1 functions (§9.2 of Gradshteyn and Ryzhik[ 1965 



Gordy[ [T998] ), making evaluation of ^ quite fast for most choices of the parameters. 
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Both T and s are global scale parameters, and do not control the behavior of p(k) at 
or oo. The parameters a and b are analogous to those of beta distribution. Smaller values 
of a encourage heavier tails in with a = 1/2, for example, yielding Cauchy-like tails. 
Smaller values of b encourage to have more mass near the origin, and eventually to 
become unbounded; b = 1/2 yields, for example, p(fi) ~ log(l + 1 //3 2 ) near 0. 

We now derive expressions for the moments of p(fi | y, a 2 ) and the marginal likelihood 
p(y | a 2 ) for priors in this family. As a special case, we easily obtain the posterior mean for 
j6 under a half-Cauchy prior on A . 

Given A 2 and a 2 , the posterior distribution of j8 is multivariate normal, with mean m 
and variance V given by 



1 \ ( 1 



Define K = 1/(1 + A 2 ). By Fubini's theorem, the posterior mean and variance of j3 are 

E(/3|y,(7 2 ) = {1-E(>c|y,a 2 )}y (3) 
var(j3|y,(7 2 ) = {1 -E(jc | y,a 2 )}a 2 , (4) 

now conditioning only on a 2 . 

It is most convenient to work with p(k) instead: 

p(ir) « k- 1 (1 - IT)*- 1 + e- Ks . (5) 



-l 



e~ KS 



The joint density for K and y takes the same functional form: 

p( yi ,y p , K) oc K"'- 1 (1 - K) b - 1 |1 + (l - 1) k} 

with a' = a + p/2, and s' = s + Z/2a 2 for Z = £f =1 .y 2 . Hence the posterior for A 2 is also a 
hypergeometric inverted-beta distribution, with parameters (a',b, z , s 1 ). 
Next, the moment-generating function of ([5]) is easily shown to be 

lJ ^(M.fl + M.l-l/T 2 ) ' 

See, for example, |Gordy| fl998 ). Expanding <3>i as a sum of \F\ functions and using the 
differentiation rules given in Chapter 15 of Abra mowitz and Stegun| ( |1964"| ) yields 

F^lvrr 2 W ia ' )n <S>i(b,l,a' + b + n,s',l-l/z 2 ) 

^ K \y>° ) {a , + b)n ^( bjha ' + b, S ',l-l/z^ ' W 

Combining the above expression with Q and (|4]) yields the conditional posterior mean 
and variance for j8, given y and a 2 . Similarly, the marginal density p(y \ a 2 ) is a simple 
expression involving the ratio of prior to posterior normalizing constants: 

p(y I a 2 ) - (IKO 2 )^ 2 exp ( Be ^'^ ^^a' + b,s' ,l-l/z 2 ) 
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Figure 1 : Ten true means drawn from a standard normal distribution; data from these means 
under standard normal noise; shrinkage estimates under the half-Cauchy prior for A. 



3 Classical risk results 

These priors are useful in situations where standard priors like the inverse-gamma or Jef- 
freys' are inappropriate or ill-behaved. Non-Bayesians will find them useful for generating 
easily computable shrinkage estimators that have known risk properties. Bayesians will 
find them useful for generating computationally tractable priors for a variance parameter. 
We argue that these complementary but overlapping goals can both be satisfied for the spe- 
cial case of the half-Cauchy. To show this, we first characterize the risk properties of the 
Bayes estimators that result from the wider family of priors used for a normal mean under 
a quadratic loss. Our analysis shows that: 

1. The hypergeometric-beta family provides a large class of Bayes estimators that will 
perform no worse than the MLE in the tails, i.e. when ||j3 1| 2 is large. 



2. Major improvements over the James-Stein estimator are possible near the origin. 
This can be done in several ways: by choosing a large relative to b, by choosing a 
and b both less than 1 , by choosing s negative, or by choosing z < 1 . Each of these 
choices involves a compromise somewhere else in the parameter space. 

We now derive expressions for the classical risk, as a function of ||/$ 1|, for the resulting 
Bayes estimators under hypergeometric inverted-beta priors. Assume without loss of gen- 
erality tha t a 2 = 1, and let p(y) = / p(y\fi)p(fi)d[} denote the marginal density of the data. 



Following Stein ( 1981 1, write the the mean-squared error of the posterior mean /3 as 



E(||^-^|| 2 )=^ + E y ^|| 5 (y)|| 2 + 2£^(y)J , 
where g(y) = Vlogp(y). In turn this can be written as 



E(\\P-P\\')=p + 4E yl p 
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P = 7 



P = 15 




Figure 2: Mean-squared error as a function of ||j3|| for p = 7 and p = 15. Solid line: 
James-Stein estimator. Dotted line: Bayes estimator under a half-Cauchy prior for X. 



We now state our main result concerning computation of this quantity. 

Proposition 1. Suppose that j8 ~ N p (0,X 2 I), that K = 1/(1 + A 2 ), and that the prior p(ic) 
is such that limpet k(1 — k)p(k) = 0. Define 

/•l p _z 

m p (Z) = I K^e i K p(K)dK 
Jo 

for Z = Yli = \yl- Then as a function of fi, the quadratic risk of the posterior mean under 
p(k) is 

E{W-f$\\ 2 ) =P + 2E m {z t ^ l -pg(Z) - \g{Z?} , (7) 
where g(Z) = E(k\ Z), and where 

z n^m = {p + z + 4) (z) _ {p + 2) _ 1 2K{ l _ K) m I (8) 

Proof. See Appendix [B] □ 

Proposition[T]is useful because it characterizes the risk in terms of two known quantities: 
the integral m p {Z), and the posterior expectation g(Z) = E(k | Z). Using the results of the 
previous section, these are easily obtained under a hypergeometric inverted-beta prior for 
X 2 . Furthermore, given ||j8 1|, Z = U 2 + V in distribution, where 

V ~ N(||/3||,l) 

V ~ Z p 2 -i- 



6 



The risk of the Bayes estimator is therefore easy to evaluate as a function of ||/$ || 2 . These 
expressions can be compared to those of, for example, George et al. (2006), who consider 
Kullback-Leibler predictive risk for similar priors. 

Our interest is in the special case a = b = 1/2, X = 1, and s = 0, corresponding to a half- 
Cauchy prior for the global scale A. Figure[2]shows the classical risk of the Bayes estimator 
under this prior for p = 1 and p = \5. The risk of the James-Stein estimator is shown for 
comparison. These pictures look similar for other values of p, and show overall that the 
half-Cauchy prior for X leads to a Bayes estimator that is competitive with the James-Stein 
estimator. 

Figure [T] shows a simple example of the posterior mean under the half-Cauchy prior for 
A when p = 10, calculated for fixed a using the results of the previous section. For this 
particular value of j8 the expected squared-error risk of the MLE is 10, and the expected 
squared-error risk of the half-Cauchy posterior mean is 8.6. 

A natural question is: of all the hypergeometric inverted-beta priors, why choose the 
half-Cauchy? There is no iron-clad reason to do so, of course, and we can imagine many 
situations where subjective information would support a different choice. But in examining 
many other members of the class, we have observed that the half-Cauchy seems to occupy 
a sensible "middle ground" in terms of frequentist risk. To study this, we are able to appeal 
to the theory of the previous section. See, for example, Figure [3} which compares several 
members of the class for the case p = 7. Observe that large gains over James-Stein near the 
origin are possible, but only at the expense of minimaxity. The half-Cauchy, meanwhile, 
still improves upon the James-Stein estimator near the origin, but does not sacrifice good 
risk performance in other parts of the parameter space. From a purely classical perspective, 
it looks like a sensible default choice, suitable for repeated general use. 



4 Global scale parameters in local-shrinkage models 

A now-canonical modification of the basic hierarchical model from the introduction in- 
volves the use of local shrinkage parameters: 

(yilPua 2 ) ~ N(ft,a 2 ) 
(Pi | A 2 ,w 2 ,a 2 ) ~ N(0,A 2 a 2 w 2 ) 

A 2 ~ g{X 2 ). 

Mixing over w, leads to a non-Gaussian marginal for j3,-. For example, choosing an exponen- 
tial prior for each uj results in a Laplace (lasso-type) prior. This class of models provides 
a Bayesian alternative to penalized-likelihood estimation. When the underlying vector of 
means is sparse, these global-local shrinkage models can lead to large improvements in 
both estimation and prediction compared with pure global shrinkage rules. There is a large 
literature on the choice of p(uf), with Poison and Scott|d201 lab providing a recent review. 



As many authors have documented, strong global shrinkage combined with heavy-tailed 
local shrinkage is why these sparse Bayes estimators work so well at sifting signals from 
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Figure 3: Mean-squared error as a function of 
hypergeometric inverted-beta hyperparameters. 



for p = 7 and various cases of the 
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Figure 4: The black line shows the marginal likelihood of the data as a function of X 
under a horseshoe prior for each (The likelihood has been renormalized to obtained a 
maximum of 1.) The blue and red lines show two priors for X: the half-Cauchy, and that 
induced by an inverse-Gamma prior on A 2 . 



noise. Intuitively, the idea is that X acts as a global parameter that adapts to the underlying 
sparsity of the signal. When few signals are present, it is quite common for the marginal 
likelihood of y as a function of X to concentrate near ("shrink globally"), and for the sig- 
nals to be flagged via very large values of the local shrinkage parameters uj ("act locally"). 
Indeed, in some cases t he marginal maximu m-likelihood solution can be the degenerate 



X = (see, for example, Tiao and Tan 1965 1 



The classical risk results of the previous section no longer apply to a model with these 
extra local-shrinkage parameters, since the marginal distribution of /3 , given X , is not multi- 
variate normal. Nonetheless, the case of sparsity serves only to amplify the purely Bayesian 
argument in favor of the half-Cauchy prior for a global scale parameter — namely, the argu- 
ment that p(X | y) should not be artificially pulled away from zero by an inverse-gamma 
prior. 

Figure |4] vividly demonstrates this point. We simulated data from a sparse model where 
j8 contained the entries (5,4,3,2, 1) along with 45 zeroes, and where y i; - ~ N(0, 1) for j = 
1,2,3. We then used Markov Chain Monte Carlo to compute the marginal likelihood of 



the data as a function of X, assuming that each /3, has a horseshoe prior (Carvalho et al. 



2010). This can be approximated by assuming a flat prior for X (here truncated above at 10), 



and then computing the conditional likelihood p(y \ X , a, u\, . . . , u p ) over a discrete grid of 
X values at each step of the Markov chain. The marginal likelihood function can then be 
approximated as the pointwise average of the conditional likelihood over the samples from 
the joint posterior. 

This marginal likelihood has been renormalized to obtain a maximum of 1 and then 
plotted alongside two alternatives: a half-Cauchy prior for X, and the prior induced by 
assuming that A 2 ~IG(l/2,l/2). Under the inverse-gamma prior, there will clearly be an 
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inappropriate biasing of p(X) away from zero, which will negatively affect the ability of 
the model to handle sparsity efficiently. For data sets with even more "noise" entries in 
y, the distorting effect of a supposedly "default" inverse-gamma prior will be even more 
pronounced, as the marginal likelihood will favor values of z very near zero (along with a 
small handful of very large uf terms). 



5 Discussion 



On strictly Bayesian grounds, the half-Cauchy is a sensible default prior for scale param- 
eters in hierarchical models: it tends to a constant at A = 0; it is quite heavy-tailed; and 
it leads to simple conjugate MCMC routines, even in more complex settings. All these 



desirable features are summarized by Gelman (2006). Our results give a quite different, 
classical justification for this prior in high-dimensional settings: its excellent quadratic risk 
properties. The fact that two independent lines of reasoning both lead to the same prior is 
a strong argument in its favor as a default proper prior for a shared variance component. 
We also recommend scaling the j3 ; 's by a, as reflected in the hierachical model from the 
introduction. This is the approach taken by Jeffreys ( 1961| Section 5.2), and we cannot 
improve upon his arguments. 

In addition, our hypergeometric inverted-beta class provides a useful generalization of 
the half-Cauchy prior, in that it allows for greater control over global shrinkage through z 
and s. It leads to a large family of estimators with a wide range of possible behavior, and 



generalizes the form noted by Maruyama ( 1999 ), which contains the positive -part James- 
Stein estimator as a limiting, improper case. Further study of this class may yield interesting 
frequentist results, quite apart from the Bayesian implications considered here. The ex- 
pressions for marginal likelihoods also have connections with recent work on generalized 
g-priors (Maruya ma and George| |2010f |Polson and Scott} |20 11 b) . Finally, all estimators 
arise from proper priors on A 2 , and will therefore be admissible. 

There are still many open issues in default Bayes analysis for hierarchical models that 
are not addressed by our results. One issue is whether to mix further over the scale in the 
half-Cauchy prior, A ~ C + (0,t). One possibility here is simply to let z ~ C + (0, 1). We 
then get the following default "double" half-Cauchy prior for A: 

2 f°° 1 1 , lnlAI 
~2 I ~ ~ —dz- 



p(X) 



K L JO 



l+T 2 T (l + ^)~- A 2 - 1 



Admittedly, it is difficult to know where to stop in this "turtles all the way down" approach 
to mixing over hyperparameters. (Why not, for example, mix still further over a scale 
parameter for T?) Even so, this prior has a number of appealing properties. It is proper, 
and therefore leads to a proper posterior; it is similar in overall shape to Jeffreys' prior; and 
it is unbounded at the origin, and will therefore not down-weight the marginal likelihood 
as much as the half-Cauchy for near-sparse configurations of j8. The implied prior on the 
shrinkage weight fc for the double half-Cauchy is 



P (k) 



In 



i-g > 

. K ) 



1 



1-2JC y/ K (l-K) 
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This is like the horseshoe prior on the shrinkage weight (Carvalho et al. 2010 1, but with 
an extra factor that comes from the fact that one is letting the scale itself be random with a 
C+(0,1) prior. 

We can also transform to the real line by letting y = In A 2 . For the half-Cauchy prior 
p(A)o< 1/(1 + A 2 ) this transformation leads to 



p(y) 



e 2 



e 1 - +e~ 



sech 



(!)• 



1+eV 

This is the hyperbolic secant distribution, which may provide fertile ground for further 
generalizations or arguments involving sensible choices for a default prior. 

A more difficult issue concerns the prior scaling for X in the presence of unbalanced 
designs — that is, when ~ N(/3;, a 2 ) for j = 1 , ... , and the n,-'s are not necessary equal. 
In this case most formal non-informative priors for X (e.g. the reference prior for a particular 



parameter ordering) involve complicated functions of the «,'s (see, e.g. Yang and Berger 



1997). These expressions emerge from a particular mathematical formalism that, in turn, 
embodies a particular operational definition of "non-informative." 

We have focused on default priors that occupy a middle ground between formal non- 
informative analysis and pure subjective Bayes. This is clearly an important situation for 
the many practicing Bayesians who do not wish to use noninformative priors, whether for 
practical, mathematical, or philosophical reasons. An example of a situation in which for- 
mal noninformative priors for X should not be used on mathematical grounds is when j8 



is expected to be sparse; see Scott and Berger (2006) for a discussion of this issue in the 
context of multiple-testing. It is by no means obvious how, or even whether, the n,'s should 
play a role in scaling X within this (admittedly ill-defined) paradigm of "default" Bayes. 

Finally, another open issue is the specification of default priors for scale parameters in 
non-Gaussian models. For example, in logistic regression, the likelihood is highly sensitive 
to large values of the underyling linear predictor. It is therefore not clear whether something 
so heavy-tailed as the half-Cauchy is an appropriate prior for the global scale term for 
logistic regression coefficients. All of these issues merit further research. 



A Details for computing moments and marginals 



The normalizing constant in Q is 



C 



K 



a-1 



(l-K) 



J3-1 



1 



+ 1 



1 



-1 



K > exp(-sff) drc. 



Let T] = 1 — K. Using the identity that e x = Y^^qX™ /m\, we obtain 

C = e s Y — / ^Z^+^-^i _ T7)«-i{i - (1 - 1/t 2 )t 7 }- 1 dr7 
^0 l m] J o 



(9) 



Using properties of the hypergeometric function 2F1 < Abramowitz and Stegun 1964 §15.1.1 
and §15.3.1), this becomes, after some straightforward algebra, 



00 00 



C = «T'Be(a,/J) £ £ 



2\m 



=0n~0 (« + j3) m +« ml n\ 



s m (l-l/V) 



(10) 
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where {a) n is the rising factorial. Appendix C of Gordy ( 1998 1 proves that, for all a > 0, 
j8 > 0, and 1 /t 2 > 0, the nested series in ( 10 ) converges to a positive real number, yielding 

C = e~' s Be(a,j3) <J>i(j3, 1, a + p,s, 1 - 1/t 2 ) , (11) 



where <I>i is the degenerate hypergeometric function of two variables ( [Gradshteyn and 
Ryzhik| [1965] 9.261). 



The <I>i function can be written as a double hypergeometric series, 



oo CO 



1 («^;y;x,v)=XI ^ ) '" + ^ ) ; /^- 
^o„= {Y)m+nm\n\ 



(12) 



where (c)„ is the rising factorial. We use three different representations of ^(a, j3 ,y,x,y) 
for handling different combinations of arguments, all from Gordy| ( T998[ ). When < y < 1 
and x > 0, 



°° (a) x n 

&i(a,p,r,x,y) = £ 7-^^ 2 Fi{p,a + n;Y+n;y) . 

n=0 Klin n - 



When < y < 1 and jc < 0, 



^ (7)« « ! 



Finally, when y < 0, 



*i(a,j5,y,x,y) = (1-v) 13 ^ r (a,p,y,-x,y) , 



(13) 



(14) 



(15) 



where a = y — a and y = y/(y— 1). Then either ( |T3| ) or ( 14 1 may be used to evaluate the 
righthand side of ( 15 ), depending on the sign of x. 



B Proof of Proposition [j] 



Proof. Begin with Stein's decomposition of risk. Following Equation (10) of Fourdrinier 
[eTaLl ( [T998] ), we have 

l|Vm(y)|| = ||y|| / jr*+V(ir)e~* K dK 
Jo 



The score can be written as 

P(y) ~ M rn P (\\y\\) ~ ME{KlZh 

and the Laplacian term is Am(y) = Jq 1 (Zk — p) k'i +1 p(ic)e ~2 K dK. Combining these terms, 
we have, 

Am (y) So (ZK-p) K^ 1 p{K)e~i K dK 

P(y) f l K'ip(K)e--2 K dK 
z m p+4 (Z) m p+2 (Z) 
m p (Z) m p (Z) 
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The risk term A yj ply)/ \J p(y) is then computed using the identity 



v 2 VpU) _ i 



which reduces to 



1 (™ P+ 4(Z) 



Ap(y) i f l|V/?(y)II V 
P(y) 2\ p(y) j 



m p (Z) ™W-2^ 2 



forg(Z) =E(k\Z). 
Secondly, note that 



Z{m p+2 (Z) - m p+4 (Z)} =2 J fc? +1 (l - K)p(x)d {-e 



Therefore, 
Z 



(16) 



m p+ i{Z) _ m p+4 (Z) 
m p (Z) m p (Z) 



+ 2)(l - k)-2k + 2k(1 - k) 



p'{k) 1 K2e 2 K p(K 



dKT 



Then under the assumption that lim K ^o,i — K )p{ K ) = 0> integration by parts gives ([8]>. 
Hence 



E(|| j 8-^|| 2 )=p + 2E z , e 



(Z + 4)g(Z) - {p + 2) - -g(Z) 2 -E K{z < 2k(1 — K 



P (k) 



□ 



References 

M. Abramowitz and I. A. Stegun, editors. Handbook of Mathematical Functions With Formulas, 
Graphs, and Mathematical Tables, volume 55 of Applied Mathematics Series. National Bureau 
of Standards, Washington, DC, 1964. Reprinted in paperback by Dover (1974); on-line at |http : | 
/ / www . math . sf u . ca/$\sim$cbm/aands/ 

J. O. Berger. A robust generalized Bayes estimator and confidence region for a multivariate normal 
mean. The Annals of Statistics, 8(4):716-761, 1980. 

C. M. Carvalho, N. G. Poison, and J. G. Scott. The horseshoe estimator for sparse signals. 
Biometrika, 97(2):465-80, 2010. 

D. Fourdrinier, W. Strawderman, and M. T. Wells. On the construction of Bayes minimax estimators. 
The Annals of Statistics, 26(2):660-71, 1998. 

A. Gelman. Prior distributions for variance parameters in hierarchical models. Bayesian Anal, 1 
(3):5 15-33, 2006. 

E. I. George, F. Liang, and X. Xu. Improved minimax predictive densities under Kullback-Leibler 
loss. The Annals of Statistics, 34(1):78-91, 2006. 



13 



M. B. Gordy. A generalization of generalized beta distributions. Finance and Economics Discussion 
Series 1998-18, Board of Governors of the Federal Reserve System (U.S.), 1998. 

I. Gradshteyn and I. Ryzhik. Table of Integrals, Series, and Products. Academic Press, 1965. 

J. Griffin and P. Brown. Alternative prior distributions for variable selection with very many more 
variables than observations. Technical report, University of Warwick, 2005. 

H. Jeffreys. Theory of Probability. Oxford University Press, 3rd edition, 1961. 

Y. Maruyama. Improving on the James-Stein estimator. Statistics and Decisions, 14:137^-0, 1999. 

Y. Maruyama and E. I. George, gbf: A fully Bayes factor with a generalized g-prior. Technical 
report, University of Tokyo, arXiv:0801.4410v2, 2010. 

C. Morris and R. Tang. Estimating random effects via adjustment for density maximization. Statis- 
tical Science, 26(2):271-87, 201 1. 

N. G. Poison and J. G. Scott. Shrink globally, act locally: sparse Bayesian regularization and 
prediction. In Proceedings of the 9th Valencia World Meeting on Bayesian Statistics. Oxford 
Univeristy Press, 201 la. 

N. G. Poison and J. G. Scott. Local shrinkage rules, Levy processes, and regularized regression. 
Journal of the Royal Statistical Society (Series B), (to appear), 201 lb. 

J. G. Scott and J. O. Berger. An exploration of aspects of Bayesian multiple testing. Journal of 
Statistical Planning and Inference, 136(7):2144-2162, 2006. 

C. Stein. Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 9: 
1135-51, 1981. 

W. Strawderman. Proper Bayes minimax estimators of the multivariate normal mean. The Annals 
of Statistics, 42:385-8, 1971. 

G. C. Tiao and W. Tan. Bayesian analysis of random-effect models in the analysis of variance, i. 
Posterior distribution of variance components. Biometrika, 51:37-53, 1965. 

R. Yang and J. O. Berger. A catalog of noninformative priors. Technical Report 42, Duke University 
Department of Statistical Science, 1997. 



14 



